NLG Evaluation: Let’s open up the box
نویسندگان
چکیده
There is a spectrum of possible shared tasks that can be used to compare NLG systems and from which we can learn. A lot depends on how we set up the rules of these games. We argue that the most useful games are not necessarily the easiest ones to play. The Lure of End-to-End Evaluation Mellish and Dale (1998) discuss a number of different approaches to NLG system evaluation that had been used by 1998. Systems can be evaluated, for instance, in terms of accuracy, fluency or in their ability to support a human task. Independent of this is the question as to whether evaluation is black box or glass box, according to whether it results in an assessment only of the complete system or also of its contributing parts. End-to-end evaluation is black box evaluation of complete NLG systems. It involves presenting systems with “naturally occurring” data and evaluating the language produced (according to accuracy, fluency, etc.). End-to-end evaluation is a tempting way to start doing NLG evaluation, because it imposes minimal constraints on the structure of the systems. Therefore as many people as possible can take part. This is important, because at the beginning critical mass is needed for things to “take off”. The Dangers of End-to-End Evaluation Unfortunately there are dangers in using an end-toend task as the basis of comparative NLG system evaluation: • Danger of overfitting the task. The best systems may have little to say about language in general, but may encode elaborate stimulus-response type structures that work for this task only. • Lack of generalisability. The best systems may have nothing to say about other NLG tasks. Or the way that systems are presented/ compared may prevent researchers in nearby areas from seeing the relevance of the techniques. So you may actually end up attracting fewer interested people.
منابع مشابه
Reuse and Challenges in Evaluating Language Generation Systems: Position Paper
Although there is an increasing shift towards evaluating Natural Language Generation (NLG) systems, there are still many NLG-specific open issues that hinder effective comparative and quantitative evaluation in this field. The paper starts off by describing a task-based, i.e., black-box evaluation of a hypertext NLG system. Then we examine the problem of glass-box, i.e., module specific, evalua...
متن کاملEvaluation in Natural Language Generation: The Question Generation Task
Question Generation (QG) is proposed as a shared-task evaluation campaign for evaluating Natural Language Generation (NLG) research. QG is a subclass of NLG that plays an important role in learning environments, information seeking, and other applications. We describe a possible evaluation framework for standardized evaluation of QG that can be used for black-box evaluation, for finer-grained e...
متن کاملA surprisingly effective out-of-the-box char2char model on the E2E NLG Challenge dataset
We train a char2char model on the E2E NLG Challenge data, by exploiting “out-of-the-box” the recently released tfseq2seq framework, using some of the standard options of this tool. With minimal effort, and in particular without delexicalization, tokenization or lowercasing, the obtained raw predictions, according to a small scale human evaluation, are excellent on the linguistic side and quite ...
متن کاملSurvey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language Generation (nlg), defined as the task of generating text or speech from non-linguistic input. A survey of nlg is timely in view of the changes that the field has undergone over the past two decades, especially in relation to new (usually data-driven) methods, as well as new applications of nlg technology. This survey therefore ...
متن کاملPutting development and evaluation of core technology first
NLG has strong evaluation traditions, in particular in user evaluations of NLG-based application systems (e.g. M-PIRO, COMIC, SUMTIME), but also in embedded evaluation of NLG components vs. non-NLG baselines (e.g. DIAG, ILEX, TAS) or different versions of the same component (e.g. SPoT). Recently, automatic evaluation against reference texts has appeared too, especially in surface realisation. W...
متن کامل